Predicting Household SNAP Participation Rates by County in the Census South Region

Data Science for Public Policy Final Project

Author

Caroline Atuhaire, David Bluhm, Johnny Willing

Background and Literature Review

The Supplemental Nutrition Assistance Program (SNAP, formerly known as Food Stamps) provides nutritional support for households with low incomes across the United States. As one of the premier means-tested benefit programs provided by the federal government, it is a crucial part of the U.S. social safety net, lifting 3.4 million people out of poverty in 2023 (Creamer and King, 2024). Understanding the factors that contribute to SNAP participation can help researchers and policymakers determine strategies to ensure that families can receive the benefits they need. To that end, this project aims to build a model to predict SNAP household participation rates at the county level using data on county- and state-level demographic, economic, political, and policy characteristics for the 17 states in the Census South region: Delaware, District of Columbia, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, West Virginia, Alabama, Kentucky, Mississippi, Tennessee, Arkansas, Louisiana, Oklahoma, and Texas.

Past research has identified numerous factors that influence SNAP participation rates. Negative economic conditions, such as high unemployment and poverty rates, are strongly correlated with SNAP participation, as are various demographic characteristics such as disability status, immigrant status, and lower educational levels (Pinard et al, 2017). Age is also an important factor, with eligible seniors enrolling in SNAP at much lower rates than eligible non-seniors (Jones et al, 2021).

Policy conditions have been found to play a significant role in determining SNAP participation. Broad-Based Categorical Eligibility (BBCE) is a policy that expands SNAP eligibility to households which qualify for Temporary Assistance for Needy Families (TANF) or a state maintenance of effort (MOE) funded benefit (FNS, n.d.a). States may choose to implement BBCE, effectively allowing them to raise the SNAP gross income limit to up to 200% of the federal poverty line and raise or eliminate the asset limit. Dickert-Conlin et al (2021) and Han (2016) find that states’ adoption of BBCE had significant positive effects on SNAP caseloads. Able-bodied adults without dependents (ABAWDs) are subject to work requirements to maintain SNAP eligibility. If an ABAWD is not working longer than three months in any three-year period, they lose eligibility for their benefits; however, states may request that this time limit be waived for part or all of the state, effectively eliminating work requirements (FNS, n.d.b). Dickert-Conlin et al (2021) find that work requirements are negatively associated with SNAP participation.

Data Sources

Packages for reading data and further analysis

library(tidyverse)
library(readxl)
library(janitor)
library(lubridate)
library(stringr)
library(tidycensus)
library(tigris)
library(sf)
library(ggplot2)
library(ranger)
library(tidymodels)
library(Metrics)
library(vip)

# Silence messages from tidycensus
options(tidycensus.quiet = TRUE)

SNAP Participation Rates

The numerator is taken from SNAP data tables published by the Food and Nutrition Service, specifically the Bi-Annual State Project Area/County Level Participation and Issuance Data published in a .zip file at this url: https://www.fns.usda.gov/pd/supplemental-nutrition-assistance-program-snap. The .zip file contains a set of excel files with counts of individuals and households participating in SNAP by county in January and July for each year from 1984 to 2023. We use a subset of those years, 2017-2019 and 2023. We choose to skip 2020-2022 because of the dramatically altered economic and policy environment during the COVID-19 pandemic which would likely make data from those years less applicable to other years. From this data, we take the average of the counts of households participating in SNAP in January and July, the two times of year this data is updated. A subset of counties are missing from the FNS dataset, so for those counties, we substitute data from the American Community Survey (ACS) 5-year estimates; table B22001 includes a variable for the number of households receiving SNAP benefits.

We take the denominator of the SNAP participation rate from the ACS B22001 table as well, as it also contains the total number of households by county. For the counties missing from the FNS dataset, we inflate the ACS data for each county since SNAP participation rates are often significantly underreported in household surveys such as the ACS (Rothbaum et al, 2021). We calculate the inflation factor at the state level, taking the average number of households receiving SNAP benefits in each state year from another FNS dataset (the .zip file under “National and/or State Level Monthly and/or Annual Data” from the url above) and dividing it by the number of households reported as receiving SNAP benefits in each state year in the ACS. For each county whose count of SNAP households came from the ACS rather than FNS, we multiply the household SNAP participation rate by the inflation factor to achieve a more accurate estimate.

counties_filepath <- "data/snap-zip-fns388a-4"
counties_filelist <- list.files(path = counties_filepath, pattern = "\\.xlsx?$", full.names = TRUE)
# function to read in SNAP county data files downloaded from FNS
read_snap_file <- function(file_path) {
  snap_df <- read_excel(file_path, skip = 3) |> clean_names()
  snap_df <- snap_df %>% select(-matches("^\\.\\.\\.|^x\\d+$"))  # drop empty columns

  file_name <- basename(file_path)
  file_month <- str_extract(file_name, "^[A-Za-z]+(?=\\s*\\d{4})")
  year <- str_extract(file_name, "\\d{4}(?=\\.)") %>% as.numeric()
  
  snap_df <- snap_df |>
    mutate(
      month = case_when(
        file_month == "JUL" ~ 7,
        file_month == "Jul" ~ 7,
        file_month == "July" ~ 7,
        TRUE ~ 1
      ),
      year = year,
      state_code = str_sub(substate_region, 1, 2),
      county_code = str_sub(substate_region, 3, 5),
      GEOID = str_sub(substate_region, 1, 5)
    ) %>%
    select(
      year, month, GEOID, state_code, county_code,
      calc_snap_total_pa_and_non_pa_households
    ) |>
    group_by(year, month, state_code, county_code, GEOID) |>
    summarize(fns_snap_households = sum(calc_snap_total_pa_and_non_pa_households)) |>
    filter(!str_detect(state_code, "\\D"))

  return(snap_df)
}
fns_snap_county <- map_dfr(counties_filelist, possibly(read_snap_file, otherwise = NULL)) %>%
  filter(year %in% c(2015,2016,2017,2018,2019,2023)) |>
  group_by(year, GEOID, state_code, county_code) |>
  summarize(fns_snap_households = round(mean(fns_snap_households)))
# function to read in SNAP state totals downloaded from FNS
read_state_totals <- function(file_path) {
  filename <- basename(file_path)
  FY <- as.numeric(str_extract(filename, "(?<=FY)\\d{2}"))
  FY <- 2000 + FY
  if (FY > 2019) {
    rows_list <- list(
      NERO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110"), 
      MARO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A98:B110", "A113:B125"), 
      SERO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110", "A113:B125"), 
      MWRO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110"), 
      SWRO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110"), 
      MPRO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110", "A113:B125"), 
      WRO = c("A8:B20", "A38:B50", "A68:B80", "A83:B95", "A98:B110", "A113:B125", "A128:B140"))
  
  } else if (FY %in% c(2015, 2016)) {
    rows_list <- list(
      NERO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110"), 
      MARO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A85:B97", "A115:B127"), 
      SERO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110", "A113:B125"), 
      MWRO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95"), 
      SWRO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80"), 
      MPRO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110", "A113:B125", "A128:B140", "A143:B155"), 
      WRO = c("A8:B20", "A25:B37", "A40:B52", "A70:B82", "A85:B97", "A100:B112", "A115:B127", "A130:B142"))

  } else {
    rows_list <- list(
      NERO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110"), 
      MARO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A85:B97", "A100:B112"), 
      SERO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110", "A113:B125"), 
      MWRO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110"), 
      SWRO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110"), 
      MPRO = c("A8:B20", "A23:B35", "A38:B50", "A53:B65", "A68:B80", "A83:B95", "A98:B110", "A113:B125"), 
      WRO = c("A8:B20", "A25:B37", "A55:B67", "A70:B82", "A85:B97", "A100:B112", "A115:B127"))
  }
  
  map_dfr(names(rows_list), function(sheet_name) {
    ranges <- rows_list[[sheet_name]]

    map_dfr(ranges, function(range) {
      df <- read_excel(file_path, sheet = sheet_name, range = range, col_names = FALSE)

      df <- df |>
        setNames(c("month", "households")) |>
        mutate(
          state_name = month[1],
          year = str_extract(month, "20\\d{2}"),
          month = str_sub(month, 1, 3)
        ) |>
        filter(!is.na(households))

      return(df)
    })
  })
}

state_filepath <- "data/snap_state_totals"
state_filelist <- list.files(path = state_filepath, pattern = "\\.xlsx?$", full.names = TRUE)

fns_state_totals <- map_dfr(state_filelist, read_state_totals) |>
  filter(!year %in% c(2014, 2020, 2022, 2024)) |>
  group_by(year, state_name) |>
  summarize(fns_snap_households = round(mean(households)))
# pull in SNAP data from ACS
acs_years <- list(2015, 2016, 2017, 2018, 2019, 2023)

acs_snap_county <- map_dfr(
  acs_years, 
  ~get_acs(
    geography = "county",
    variables = c(
      total_households = "B22001_001",
      acs_snap_households  = "B22001_002"
    ),
    year = .x,
    survey = "acs5",
    geometry = FALSE,
    cache_table = TRUE
  ) |> mutate(year = .x)) |>
  select(year, GEOID, NAME, variable, estimate) |>
  pivot_wider(names_from = variable, values_from = estimate) |>
  mutate(
    acs_snap_rate = acs_snap_households / total_households,
    state_fips = str_sub(GEOID, 1, 2)
  ) |>
  filter(!state_fips %in% c("60", "66", "69", "72", "78"))

acs_state_totals <- acs_snap_county |>
  group_by(year, state_fips) |>
  summarize(acs_snap_households = sum(acs_snap_households))

state_fips <- fips_codes |>
  select(state_code, state_name) |>
  distinct(state_code, .keep_all = TRUE)

acs_inflators <- fns_state_totals |>
  left_join(state_fips, by = "state_name") |>
  mutate(year = as.numeric(year)) |>
  left_join(acs_state_totals, by = c("state_code" = "state_fips", "year" = "year")) |>
  group_by(year, state_code) |>
  mutate(inflator = fns_snap_households/acs_snap_households) |>
  select(year, state_name, state_code, inflator)

acs_snap_county_infl <- acs_snap_county |>
  left_join(acs_inflators, by = c("state_fips" = "state_code", "year" = "year")) |>
  mutate(acs_snap_rate_infl = acs_snap_rate*inflator)

snap_participation <- acs_snap_county_infl |>
  full_join(fns_snap_county, by = c("GEOID", "year")) |>
  mutate(fns_snap_rate = fns_snap_households/total_households,
         snap_participation_rate = if_else(!is.na(fns_snap_rate), fns_snap_rate, acs_snap_rate_infl),
         snap_rate_acs = if_else(!is.na(fns_snap_rate), 0, 1)) |>
  select(year, GEOID, state_name, snap_participation_rate, snap_rate_acs, total_households)

us_counties <- counties(cb = TRUE) |>
  filter(!STATEFP %in% c("60", "66", "69", "72", "78", "02", "15"))
snap_map <- us_counties |>
  left_join(snap_participation, by = "GEOID")

snap_map |>
  filter(year == 2015) |>
  ggplot() +
  geom_sf(mapping = aes(fill = snap_participation_rate)) +
  scale_fill_gradient(low = "#cfe8f3", high = "#062635") +
  theme_void() +
  labs(title = "Household SNAP Participation by County in the Contiguous 48 States")

Demographic and Economic Characteristics

Census ACS Data

We also take demographic data from the ACS for the years 2015 to 2019 and 2023. Specifically, we take data for things like the number of white respondents, number of respondents that are renters, and the number of respondents in poverty, then divide them by the total number of people that were asked that question to calculate a percentage for each county. In all, we use ACS data to measure the percent male, median age, percent white, percent that are families, percent with a bachelor’s degree or more, percent in poverty, percent disabled, percent renting their home, median age, median household income, percent that are not citizens, and percent over 65 years old.

#Census Data
#######
census_vars <- tidycensus::load_variables(year = 2016,
                                          dataset = c("acs5"))

#B06011_001E is median income
#B19013_001 is median household income in past 12 months (colinearity w/median income?)
#B01001_001E is total population
#B01001_002E is number of males
#B17001_002E is number below poverty line
#B01002_001 is median age (total)
#B03002_003 is total number of non-hispanic white (_001 is overall)
#B15003_017 to _025 is education attainment levels
#B11001_002 is number of family households (_001 is overall)
#B18101_001 is total number disabled status
      #_004,_007, etc are number that are disabled for age/sex groups
#B25003_003 is total number of renters (_001 is overall)
#B22001_002 is total number of households that received food stamps in prior 12 months (_001 is overall)
#B05001_006 is total number of non-citizens (_001 is overall)
#B01001_020-25 and _044-049 is total over 65 (_001 is overall)


get_snap <- function(year) {
  
  available_vars <- load_variables(year, "acs5", cache = TRUE)$name
  
  snap_vars <- c("B06011_001E", "B01001_001E", "B01001_002E",
                      "B17001_002E", "B17001_001E", "B19013_001",
                      "B01002_001", "B03002_003", "B03002_001",
                      "B15003_001", "B15003_017", "B15003_018", "B15003_019",
                      "B15003_020", "B15003_021", "B15003_022",
                      "B15003_023", "B15003_024", "B15003_025",
                      "B11001_002", "B11001_001", "B25003_003",
                      "B25003_001", "B18101_001", "B22001_002", "B22001_001",
                 "B18101_004", "B18101_007", "B18101_010", "B18101_013",
                 "B18101_016", "B18101_019", "B18101_022", "B18101_025",
                 "B18101_028", "B18101_031", "B05001_006", "B05001_001",
                 "B01001_020", "B01001_021", "B01001_022", "B01001_023",
                 "B01001_024", "B01001_025", "B01001_044", "B01001_045",
                 "B01001_046", "B01001_047", "B01001_048", "B01001_049")
  
  snap_vars_nosuffix <- gsub("E$", "", snap_vars)
  
  usable_vars <- intersect(snap_vars_nosuffix, available_vars)
  
  if (length(usable_vars) == 0) {
    warning(paste0("No matching variables available for year ", year))
    return(NULL)
  }
  
  census_demo <- get_acs(geography = "county",
                         variables = usable_vars,
                         year = year,
                         survey = "acs5",
                         output = "tidy",
                         progress_bar = FALSE) |>
    mutate(year = year) |>
    mutate(variable = case_when(
      variable == "B06011_001" ~ "median_income_individual",
      variable == "B01001_001" ~ "total_population",
      variable == "B01001_002" ~ "male_population",
      variable == "B17001_002" ~ "below_poverty",
      variable == "B17001_001" ~ "poverty_total",
      variable == "B19013_001"  ~ "median_household_income",
      variable == "B01002_001"  ~ "median_age",
      variable == "B03002_003"  ~ "white_alone",
      variable == "B03002_001"  ~ "total_race_population",
      variable == "B15003_001"  ~ "total_educ",
      variable == "B15003_017"  ~ "hs_grad",
      variable == "B15003_018"  ~ "ged",
      variable == "B15003_019"  ~ "some_college_less_1yr",
      variable == "B15003_020"  ~ "some_college",
      variable == "B15003_021"  ~ "assoc_degree",
      variable == "B15003_022"  ~ "bachelors_degree",
      variable == "B15003_023"  ~ "masters_degree",
      variable == "B15003_024"  ~ "professional_degree",
      variable == "B15003_025"  ~ "doctoral_degree",
      variable == "B11001_002"  ~ "family_households",
      variable == "B11001_001"  ~ "total_households",
      variable == "B25003_003"  ~ "renter_occupied",
      variable == "B25003_001"  ~ "total_occupied_housing",
      variable == "B18101_001"  ~ "total_disability_universe",
      variable == "B22001_002"  ~ "snap_recipient_hh_prior12",
      variable == "B22001_001"  ~ "total_snap_universe",
      variable == "B05001_006" ~ "non_citizens",
      variable == "B05001_001" ~ "non_citizens_universe",
      TRUE ~ variable  # keep original name if no match
    ))
  
  census_wide <- census_demo |>
    select(GEOID, NAME, year, variable, estimate) |>
    pivot_wider(names_from = "variable",
                values_from = "estimate")
  
  census_wide <- census_wide |>
    mutate(male_pct = (male_population/total_population),
           white_pct = (white_alone/total_race_population),
           family_pct = (family_households/total_households),
           bachelor_or_higher_pct = ((bachelors_degree + masters_degree + 
                                        professional_degree + doctoral_degree)/
                                       total_educ),
           poverty_pct = (below_poverty/poverty_total),
           disabled_pct = ((B18101_004 + B18101_007 + B18101_010 + B18101_013 +
                             B18101_016 + B18101_019 + B18101_022 + B18101_025 +
                             B18101_028 + B18101_031)/total_disability_universe),
           snap_pct = (snap_recipient_hh_prior12/total_snap_universe),
           renter_pct = (renter_occupied/total_occupied_housing),
           non_citizen_pct = (non_citizens/non_citizens_universe),
           over_65_pct = ((B01001_020 + B01001_021 + B01001_022 + B01001_023 + 
                             B01001_024 + B01001_025 + B01001_044 + B01001_045 +
                             B01001_046 + B01001_047 + B01001_048 + B01001_049)/
                            total_population)) |>
    select(GEOID, NAME, year, male_pct, median_age, white_pct, family_pct, bachelor_or_higher_pct,
           poverty_pct, disabled_pct, snap_pct, renter_pct, median_age,
           median_income_individual, median_household_income, 
           snap_recipient_hh_prior12, non_citizen_pct, over_65_pct)
           
  
  assign(paste0("snap_", year), census_wide, envir = .GlobalEnv)
}

years <- c(2015, 2016, 2017, 2018, 2019, 2023)

map(.x = years, .f = get_snap)


combined_census <- bind_rows(snap_2015, snap_2016, snap_2017, snap_2018, 
                             snap_2019, snap_2023) |>
  arrange(GEOID)

USDA Rural-Urban Data

Another data source we use is the Rural Urban Continuum published by the USDA (available here: https://www.ers.usda.gov/data-products/rural-urban-continuum-codes). This data assigns a value to every county based on how rural (9 being the maximum) or urban (1 being the minimum) it is.

ruralurbancodes2013 <- read_excel("data/ruralurbancodes2013.xls")
ruralurbancodes2023 <- read_excel("data/Ruralurbancontinuumcodes2023.xlsx")

ruralurbancodes2013 <- ruralurbancodes2013 |>
  rename(GEOID = FIPS)

ruralurbancodes2023 <- ruralurbancodes2023 |>
  rename(GEOID = FIPS)

census_urban_rural <- combined_census |>
  left_join(ruralurbancodes2013 |>
              select(GEOID, RUCC_2013), by = "GEOID") |>
  left_join(ruralurbancodes2023 |>
              select(GEOID, RUCC_2023), by = "GEOID") |>
  mutate(rural_urban_score = case_when(
    year == 2023 ~ RUCC_2023,
    TRUE ~ RUCC_2013)) |>
  select(-RUCC_2013, -RUCC_2023)

University of Kentucky Welfare Data

A third source of demographic data comes from the University of Kentucky’s Center for Poverty Research, who have published national welfare data (available here: https://cpr.uky.edu/resources/national-welfare-data). From this data, we can collect information for each state like the unemployment rate, percent of residents that have varying degrees of food insecurity, the gross state product, whether the governor is a Democrat and the percentage of state legislatures that are Democrats, and the minimum wage. The political data has some quirks resulting in missingness, like D.C. having no governor and Nebraska having a unicameral legislature. To address this, we manually impute the political party for D.C.’s mayor and apply Nebraska’s legislature data to both chambers.

ky_welfare <- read_excel("data/UKCPR_National_Welfare_Data_1980_2023.xlsx", 
                         sheet = "Data", 
                         col_types = c("text", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "text", "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric", 
                                       "numeric", "numeric", "numeric")) |>
  janitor::clean_names()

ky_welfare <- ky_welfare |>
  filter(year %in% c(2015,2016,2017,2018,2019,2023)) |>
  mutate(state_fips = str_pad(state_fips, width = 2, side = "left", pad = "0")) |>
  select(state_fips, year, unemployment_rate, marginally_food_insecure, food_insecure,
         very_low_food_secure, gross_state_product, governor_is_democrat_1_yes,
         fraction_of_state_house_that_is_democrat, fraction_of_state_senate_that_is_democrat,
         state_minimum_wage)

census_urban_rural <- census_urban_rural |>
  mutate(state_fips = substr(GEOID, 1, 2))

demographic_combined <-  census_urban_rural |>
  left_join(ky_welfare, by = c("state_fips", "year")) 

#Filtering out Puerto Rico
demographic_combined <- demographic_combined |>
  filter(state_fips != 72)

#Manually imputing 1 to reflect DC's Democratic mayor each year
demographic_combined <- demographic_combined |>
  mutate(governor_is_democrat_1_yes = ifelse(GEOID == "11001", 1, governor_is_democrat_1_yes))

demographic_combined <- demographic_combined |>
  mutate(fraction_of_state_house_that_is_democrat = ifelse(
    GEOID == "11001", 1, fraction_of_state_house_that_is_democrat))

#Manually imputing Nebraska's unicameral makeup to both state house and senate pct
demographic_combined <- demographic_combined |>
  mutate(fraction_of_state_house_that_is_democrat = ifelse(
    state_fips == "31" & year %in% c(2015, 2016), 12/49, 
    fraction_of_state_house_that_is_democrat))

demographic_combined <- demographic_combined |>
  mutate(fraction_of_state_senate_that_is_democrat = ifelse(
    state_fips == "31" & year %in% c(2015, 2016), 12/49, 
    fraction_of_state_senate_that_is_democrat))

demographic_combined <- demographic_combined |>
  mutate(fraction_of_state_house_that_is_democrat = ifelse(
    state_fips == "31" & year %in% c(2017, 2018, 2019), 15/49, 
    fraction_of_state_house_that_is_democrat))

demographic_combined <- demographic_combined |>
  mutate(fraction_of_state_senate_that_is_democrat = ifelse(
    state_fips == "31" & year %in% c(2017, 2018, 2019), 15/49, 
    fraction_of_state_senate_that_is_democrat))

demographic_combined <- demographic_combined |>
  mutate(fraction_of_state_house_that_is_democrat = ifelse(
    state_fips == "31" & year %in% c(2023), 17/49, 
    fraction_of_state_house_that_is_democrat))

demographic_combined <- demographic_combined |>
  mutate(fraction_of_state_senate_that_is_democrat = ifelse(
    state_fips == "31" & year %in% c(2023), 17/49, 
    fraction_of_state_senate_that_is_democrat))

SNAP Policy Predictors

We include four variables covering state-level policies governing SNAP eligibility, three of which were constructed from the FNS’s SNAP Policy Database (available here: https://www.ers.usda.gov/data-products/snap-policy-data-sets) and one of which was entered manually. Specifically, our data contains: * A binary variable indicating whether a state implemented BBCE in a given year, taken directly from the SNAP Policy Database for 2015-2019. For 2023, this information was collected from FNS’s BBCE site (accessed for 2023 via the Wayback Machine here: https://web.archive.org/web/20230331232320/https://www.fns.usda.gov/snap/broad-based-categorical-eligibility). * A continuous variable that measures the state’s income limit for SNAP recipients as a percentage of the federal poverty line. For states that implement BBCE, the SNAP Policy Database contains this information for 2015-2019, with non-BBCE states coded as missing. We coded non-BBCE states’ as 130, because 130% is the default income limit for SNAP eligibility. For 2023, this information was collected from the FNS BBCE site linked above. * A categorical variable that equals 0 if states have the default federal asset limit for SNAP eligibility ($2,250 for most of the years in the sample), 1 if states have increased the asset limit via BBCE, and 2 if states have eliminated the asset limit entirely. This variable was constructed from the SNAP Policy Database for 2015-2019. For 2023, this information was collected from the FNS BBCE site linked above. * A categorical variable that equals 0 for states that have not waived the ABAWD work requirement time limit, 1 for states that have waived the time limit for part of the state, and 2 for states that have waived the time limit throughout the entire state. For 2015-2019, this data was collected by visually inspecting a map published by the Center on Budget and Policy Priorities showing areas covered by ABAWD time limit waivers for all 50 states (available here: https://www.cbpp.org/research/states-have-requested-waivers-from-snaps-time-limit-in-high-unemployment-areas-for-the-past). For DC, 2015-2019 ABAWD time limit waiver data was collected directly from FNS (available here: https://www.fns.usda.gov/snap/abawd/waivers/2015-2019). For 2023, this information was collected from the 2023 State Options Report (https://fns-prod.azureedge.us/sites/default/files/resource-files/snap-15th-state-options-report-october23.pdf).

poldb <- read_xlsx("data/SNAPPolicyDatabase.xlsx", sheet = "SNAP Policy Database")

abawd_waivers <- read_xlsx("data/SNAPPolicyDatabase.xlsx", sheet = "ABAWD Waivers")

poldb_2023 <- read_xlsx("data/SNAPPolicyDatabase.xlsx", sheet = "2023 Data")

mode <- function(x) {
  u <- unique(x)
  tab <- tabulate(match(x, u))
  u[tab == max(tab)]
}

combined_poldb <- poldb |>
  mutate(year = as.numeric(str_sub(as.character(yearmonth), 1, 4))) |>
  group_by(year, state_fips) |>
  summarize(bbce = mode(bbce),
            bbce_asset = mode(bbce_asset),
            bbce_inclmt = mode(bbce_inclmt)) |>
  group_by(year,state_fips) |>
    summarize(bbce = last(bbce),
            bbce_asset = last(bbce_asset),
            bbce_inclmt = last(bbce_inclmt)) |>
  left_join(abawd_waivers, by = c("state_fips", "year")) |>
  bind_rows(poldb_2023) |>
  filter(year > 2014,
         year != 2020) |>
  mutate(state_fips = as.character(state_fips),
         state_fips = case_when(
           state_fips == "1" ~ "01",
           state_fips == "2" ~ "02",
           state_fips == "4" ~ "04",
           state_fips == "5" ~ "05",
           state_fips == "6" ~ "06",
           state_fips == "8" ~ "08",
           state_fips == "9" ~ "09",
           TRUE ~ state_fips
         ))

Data Wrangling and Exploratory Data Analysis

Merging datasets

snap_2model <- snap_participation |>
  left_join(demographic_combined, by = c("GEOID", "year")) |>
  left_join(combined_poldb, by = c("state_fips", "year")) |>
  filter(snap_participation_rate < 1,
       state_name %in% c("Alabama", "Arkansas", "Delaware", "District of Columbia", "Florida", "Georgia", "Louisiana",
                         "Kentucky", "Louisiana", "Maryland", "Mississippi", "North Carolina", "Oklahoma", "South Carolina",
                         "Tennessee", "Texas", "Virginia", "West Virginia")) |> 
  select(-c(median_income_individual, state_name, NAME, snap_recipient_hh_prior12, snap_pct)) |>
  drop_na()

# Exclude counties that don't have data for all years
county_observations <- snap_2model |>
  group_by(GEOID) |>
  summarize(num_obs = n()) |>
  filter(num_obs < 6)

snap_2model <- snap_2model |>
  anti_join(county_observations, by = c("GEOID"))

# Remove extraneous dataframes to save memory
rm(abawd_waivers, acs_inflators, acs_snap_county, acs_snap_county_infl, acs_state_totals, acs_years, census_urban_rural, 
   census_vars, combined_census, combined_poldb, county_observations, demographic_combined, fns_snap_county, fns_state_totals,
   ky_welfare, poldb, poldb_2023, ruralurbancodes2013, ruralurbancodes2023, snap_2015, snap_2016, snap_2017, snap_2018, 
   snap_2019, snap_2023, snap_participation, state_fips, us_counties, snap_map)

Split up testing and training

We are building our predictive model using data from years 2017-2019 and 2023, training on 2017-2019 and testing on 2023. We do not use the years 2020-2022, as the COVID-19 pandemic dramatically altered the policy and economic landscape which could undermine the training of our model. We begin by splitting the data into training and testing sets, and then conducting exploratory data analysis.

snap_test <- snap_2model |>
  filter(year==2023)

snap_train <- snap_2model |>
  filter(year%in% c(2017, 2018, 2019)) |>
  arrange(year)

snap_train |> group_by(year) |> summarize(n = n())
# A tibble: 3 × 2
   year     n
  <dbl> <int>
1  2017  1420
2  2018  1420
3  2019  1420

Exploratory Data Analysis

We begin by simply examining average county SNAP participation rates by year and by state (for 2019) within the training dataset.

snap_train |>
  group_by(year) |>
  summarize(year_avg_rate = mean(snap_participation_rate, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = year, y = year_avg_rate)) +
  geom_line(size = 1.2, color = "darkblue") +
  geom_point(color = "darkred") +
  labs(
    title = "Southern State Average SNAP Participation Rate by Year (2017–2019)",
    x = "Year", y = "Average Participation Rate"
  ) +
  scale_x_continuous(breaks = 2017:2019) +
  theme_minimal()

Average county SNAP participation rates decreased slightly from 2017 to 2019, from about 21.5% to about 19%. Note that these are averages of county rates, not the overall rate for all counties, so they weight each county equally.

snap_train |>
  mutate(state_name = case_when(state_fips == "01" ~ "Alabama",
                                state_fips == "05" ~ "Arkansas",
                                state_fips == "10" ~ "Delaware",
                                state_fips == "11" ~ "District of Columbia", 
                                state_fips == "12" ~ "Florida", 
                                state_fips == "13" ~ "Georgia", 
                                state_fips == "21" ~ "Kentucky",
                                state_fips == "22" ~ "Louisiana",
                                state_fips == "24" ~ "Maryland", 
                                state_fips == "28" ~ "Mississippi",
                                state_fips == "37" ~ "North Carolina", 
                                state_fips == "40" ~ "Oklahoma", 
                                state_fips == "45" ~ "South Carolina",
                                state_fips == "47" ~ "Tennessee", 
                                state_fips == "48" ~ "Texas", 
                                state_fips == "51" ~ "Virginia", 
                                state_fips == "54" ~ "West Virginia")) |> 
  filter(year == 2019) |>
  group_by(state_name) |>
  summarize(state_avg_rate = mean(snap_participation_rate, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = state_name, y = state_avg_rate)) +
  geom_col() +
  labs(
    title = "Average SNAP Participation Rate by State 2019",
    x = "", y = "Average Participation Rate"
  ) +
  theme(axis.text.x = element_text(angle = 45, vjust = .75))

Average SNAP participation rates vary significantly across states, ranging from Virginia at just below 15%, to its neighbor West Virginia at nearly 25%.

Now we investigate the relationship between SNAP participation rates and various county-level economic conditions.

snap_train |> filter(year == 2019) |>
  ggplot(mapping = aes(x = median_household_income, y = snap_participation_rate)) + 
  geom_point(alpha = 0.25) +
  stat_smooth(method = "lm", 
              formula = y ~ x, 
              geom = "smooth") +
  labs(x = "Median Household Income", y = "SNAP Participation Rate", title = "SNAP Participation Rate by Household Income 2019") +
  theme_minimal()

snap_train |> filter(year == 2019) |>
  ggplot(mapping = aes(x = poverty_pct, y = snap_participation_rate)) + 
  geom_point(alpha = 0.25) +
  stat_smooth(method = "lm", 
              formula = y ~ x, 
              geom = "smooth") +
  labs(x = "Poverty Rate", y = "SNAP Participation Rate", title = "SNAP Participation Rate by Poverty Rate 2019") +
  theme_minimal()

snap_train |> filter(year == 2019) |>
  ggplot(mapping = aes(x = unemployment_rate, y = snap_participation_rate)) + 
  geom_point(alpha = 0.25) +
  stat_smooth(method = "lm", 
              formula = y ~ x, 
              geom = "smooth") +
  labs(x = "Unemployment Rate", y = "SNAP Participation Rate", title = "SNAP Participation Rate by Unemployment Rate 2019") +
  theme_minimal()

We see a strong negative association between SNAP participation rates and median household income, and a positive associations between SNAP participation rates and poverty and unemployment. The relationship between poverty and SNAP participation is especially strong, tightly hugging the line of best fit. These associations are not surprising, given SNAP’s status as a means-tested benefit program.

We now examine the relationship between SNAP and county demographic characteristics.

snap_train |> filter(year == 2019) |>
  ggplot(mapping = aes(x = white_pct, y = snap_participation_rate)) + 
  geom_point(alpha = 0.25) +
  stat_smooth(method = "lm", 
              formula = y ~ x, 
              geom = "smooth") +
  labs(x = "Percentage White", y = "SNAP Participation Rate", title = "SNAP Participation Rate by County Racial Demographics 2019") +
  theme_minimal()

snap_train |> filter(year == 2019) |>
  mutate(rural_urban_score = factor(rural_urban_score)) |>
  group_by(rural_urban_score) |>
  summarize(urban_avg_rate = mean(snap_participation_rate, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = rural_urban_score, y = urban_avg_rate)) +
  geom_col() +
  labs(
    title = "Average SNAP Participation Rate by Urbanicity",
    x = "Urban-Rural Score", y = "Average Participation Rate"
  ) +
  theme_minimal()

Counties with a higher percentage of white residents tend to participate less in SNAP. Urbanicity does not have a simple linear relationship with SNAP participation. The most urban counties (those with an urban-rural score of 1) have the lowest rates of SNAP participation, and the highest levels are seen in suburban counties, with the most rural counties (with a score of 9) in the middle.

Finally, we explore the relationship between SNAP participation and our selected policy variables.

snap_train |> filter(year == 2019) |>
  group_by(bbce) |>
  mutate(bbce = if_else(bbce == 0, "No", "Yes")) |>
  summarize(bbce_avg_rate = mean(snap_participation_rate, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = bbce, y = bbce_avg_rate)) +
  geom_col() +
  labs(
    title = "Average SNAP Participation Rate by BBCE Participation",
    x = "BBCE", y = "Average Participation Rate"
  ) +
  theme_minimal()

snap_train |> filter(year == 2019) |>
  mutate(bbce_inclmt = if_else(bbce_inclmt == -9, 130, bbce_inclmt),
        bbce_inclmt = factor(bbce_inclmt)) |>
  group_by(bbce_inclmt) |>
  summarize(inclt_avg_rate = mean(snap_participation_rate, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = bbce_inclmt, y = inclt_avg_rate)) +
  geom_col() +
  labs(
    title = "Average SNAP Participation Rate by Income Limit",
    x = "SNAP Income Limit as Percent of Federal Poverty Line", y = "Average Participation Rate"
  ) +
  theme_minimal()

snap_train |> filter(year == 2019) |>
  mutate(abawdwaiver= case_when(
    abawdwaiver == 0 ~ "No waivers",
    abawdwaiver == 1 ~ "Waivers Covering Part of State",
    abawdwaiver == 2 ~ "Waivers for Entire State"
  )) |>
  group_by(abawdwaiver) |>
  summarize(waiver_avg_rate = mean(snap_participation_rate, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = abawdwaiver, y = waiver_avg_rate)) +
  geom_col() +
  labs(
    title = "Average SNAP Participation Rate by Waiver Status",
    x = "ABAWD Time Limit Waivers", y = "Average Participation Rate"
  ) +
  theme_minimal()

We don’t see any difference in SNAP participation rates between states who implement BBCE and those who don’t. Moreover, SNAP participation rates do not seem to have a simple linear association with income limits for eligibility. Perhaps there are other variables that confound the relationships between BBCE, the income limit, and SNAP participation, such as poverty. ABAWD time limit waivers, unsurprisingly, have a positive association with SNAP participation, suggesting that these waivers make it easier for people to receive SNAP benefits.

Data Analysis

Training models

Creating folds and a recipe

Now that we have explored our training data set, we will begin training our models. The first step is setting up resamples: for the three training years, we have two resamples: the first trains on 2017 and tests on 2018, and the second trains on 2018 and tests on 2019. The final fit will ultimately be tested on 2019, the closest year to our testing year.

# set up folds
snap_folds <- rolling_origin(snap_train, initial = 1420, assess = 1420, cumulative = FALSE)

snap_2019 <- snap_train |>
  filter(year==2019)

We then pre-process our data. We remove predictors with zero variance, update the BBCE-related variables as described in the data sources section above, convert all nominal predictors into dummy variables, and normalize all numeric predictors.

# create recipe
snap_rec <- 
  recipe(formula = snap_participation_rate ~ ., data = snap_train) |>
    step_zv(all_predictors()) |>
    update_role(GEOID, new_role = "id variable") |>
    step_mutate(bbce_inclmt = if_else(bbce_inclmt == -9, 130, bbce_inclmt),
              bbce_asset = case_when(
                bbce_asset == 1 ~ 2,
                bbce_asset == 0 ~ 1,
                bbce_asset == -9 ~ 0
              ),
              bbce_asset = factor(bbce_asset),
              abawdwaiver = factor(abawdwaiver)) |>
  step_normalize(all_numeric_predictors(), -c(snap_rate_acs, rural_urban_score, governor_is_democrat_1_yes, bbce)) |>
  step_dummy(all_nominal_predictors())

Linear Regression Elastic Net

Our first model to test is an elastic net. We tune the penalty hyperparameter to find the value that minimizes RMSE.

lm_spec <- linear_reg(penalty = tune(),
                      mixture = .5) |>
  set_engine("glmnet") |>
  set_mode("regression")

lm_grid <- grid_regular(penalty(), levels = 10)

lm_wf <- workflow() |>
  add_recipe(snap_rec) |>
  add_model(lm_spec)

lm_res <- lm_wf |>
  tune_grid(resamples = snap_folds,
            grid = lm_grid,
            metrics = metric_set(yardstick::rmse, yardstick::mae))

lm_res |>
  collect_metrics()
# A tibble: 20 × 7
         penalty .metric .estimator   mean     n    std_err .config             
           <dbl> <chr>   <chr>       <dbl> <int>      <dbl> <chr>               
 1 0.0000000001  mae     standard   0.0317  1421 0.0000311  Preprocessor1_Model…
 2 0.0000000001  rmse    standard   0.0423  1421 0.0000327  Preprocessor1_Model…
 3 0.00000000129 mae     standard   0.0317  1421 0.0000311  Preprocessor1_Model…
 4 0.00000000129 rmse    standard   0.0423  1421 0.0000327  Preprocessor1_Model…
 5 0.0000000167  mae     standard   0.0317  1421 0.0000311  Preprocessor1_Model…
 6 0.0000000167  rmse    standard   0.0423  1421 0.0000327  Preprocessor1_Model…
 7 0.000000215   mae     standard   0.0317  1421 0.0000311  Preprocessor1_Model…
 8 0.000000215   rmse    standard   0.0423  1421 0.0000327  Preprocessor1_Model…
 9 0.00000278    mae     standard   0.0317  1421 0.0000311  Preprocessor1_Model…
10 0.00000278    rmse    standard   0.0423  1421 0.0000327  Preprocessor1_Model…
11 0.0000359     mae     standard   0.0317  1421 0.0000311  Preprocessor1_Model…
12 0.0000359     rmse    standard   0.0423  1421 0.0000327  Preprocessor1_Model…
13 0.000464      mae     standard   0.0318  1421 0.0000270  Preprocessor1_Model…
14 0.000464      rmse    standard   0.0424  1421 0.0000284  Preprocessor1_Model…
15 0.00599       mae     standard   0.0331  1421 0.0000104  Preprocessor1_Model…
16 0.00599       rmse    standard   0.0443  1421 0.00000899 Preprocessor1_Model…
17 0.0774        mae     standard   0.0559  1421 0.00000806 Preprocessor1_Model…
18 0.0774        rmse    standard   0.0708  1421 0.0000182  Preprocessor1_Model…
19 1             mae     standard   0.0709  1421 0.0000215  Preprocessor1_Model…
20 1             rmse    standard   0.0898  1421 0.0000428  Preprocessor1_Model…
lm_res |>
  autoplot()

K-Nearest Neighbor

Our next model is K-Nearest Neighbors. Here, we tune the model to select the number of neighbors that minimizes RMSE.

knn_spec <- nearest_neighbor(neighbors = tune()) |>
  set_engine(engine = "kknn") |>
  set_mode(mode = "regression")

knn_wf <- workflow() |>
  add_model(spec = knn_spec) |>
  add_recipe(recipe = snap_rec)

knn_grid <- grid_regular(neighbors(range = c(1, 15)), levels = 8)

knn_res <- knn_wf |>
  tune_grid(resamples = snap_folds, 
            grid = knn_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(yardstick::rmse, yardstick::rsq))

knn_res |>
  collect_metrics()
# A tibble: 16 × 7
   neighbors .metric .estimator   mean     n  std_err .config             
       <int> <chr>   <chr>       <dbl> <int>    <dbl> <chr>               
 1         1 rmse    standard   0.0348  1421 0.000261 Preprocessor1_Model1
 2         1 rsq     standard   0.875   1421 0.00255  Preprocessor1_Model1
 3         3 rmse    standard   0.0331  1421 0.000207 Preprocessor1_Model2
 4         3 rsq     standard   0.875   1421 0.00204  Preprocessor1_Model2
 5         5 rmse    standard   0.0351  1421 0.000172 Preprocessor1_Model3
 6         5 rsq     standard   0.858   1421 0.00174  Preprocessor1_Model3
 7         7 rmse    standard   0.0364  1421 0.000150 Preprocessor1_Model4
 8         7 rsq     standard   0.848   1421 0.00150  Preprocessor1_Model4
 9         9 rmse    standard   0.0375  1421 0.000136 Preprocessor1_Model5
10         9 rsq     standard   0.839   1421 0.00132  Preprocessor1_Model5
11        11 rmse    standard   0.0386  1421 0.000128 Preprocessor1_Model6
12        11 rsq     standard   0.832   1421 0.00119  Preprocessor1_Model6
13        13 rmse    standard   0.0396  1421 0.000121 Preprocessor1_Model7
14        13 rsq     standard   0.824   1421 0.00108  Preprocessor1_Model7
15        15 rmse    standard   0.0405  1421 0.000116 Preprocessor1_Model8
16        15 rsq     standard   0.818   1421 0.000995 Preprocessor1_Model8
knn_res |>
  autoplot()

knn_best <- knn_res |>
  select_best(metric = "rmse")

Random Forest

Our final model is a random forest.

set.seed(05072025)

rf_spec <- rand_forest(
  trees = 100,
  min_n = 5) |>
  set_mode("regression") |>
  set_engine("ranger")

rf_wf <- workflow() |>
  add_model(rf_spec) |>
  add_recipe(snap_rec)

rf_res <- rf_wf |>
  fit_resamples(resamples = snap_folds)

rf_res |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator   mean     n   std_err .config             
  <chr>   <chr>       <dbl> <int>     <dbl> <chr>               
1 rmse    standard   0.0382  1421 0.0000210 Preprocessor1_Model1
2 rsq     standard   0.843   1421 0.000269  Preprocessor1_Model1

Testing Final Model

The K-Nearest Neighbors model with three neighbors was the best performing model, with an RMSE of about 3.3 percentage points. So we move forward and fit that model on data from 2019 and test it on 2023.

knn_best <- knn_res |>
  select_best(metric = "rmse")

final_fit <- knn_wf |>
  finalize_workflow(parameters = knn_best) |>
  fit(data = snap_2019)

predictions <- bind_cols(
  snap_test,
  predict(object = final_fit, new_data = snap_test))

select(predictions, snap_participation_rate, .pred)
# A tibble: 1,420 × 2
   snap_participation_rate  .pred
                     <dbl>  <dbl>
 1                   0.162 0.126 
 2                   0.109 0.0985
 3                   0.328 0.268 
 4                   0.231 0.207 
 5                   0.152 0.134 
 6                   0.383 0.300 
 7                   0.360 0.311 
 8                   0.243 0.213 
 9                   0.264 0.253 
10                   0.185 0.180 
# ℹ 1,410 more rows
rmse(predictions$snap_participation_rate, predictions$.pred)
[1] 0.05624878
summary(snap_test$snap_participation_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.1278  0.1826  0.2006  0.2588  0.6196 

Discussion of Results

Error Rates

When implemented on the testing data, the RMSE was 5.62 percentage points, which is significantly higher than the average RMSE of the best model during cross-validation. Compared to the average SNAP participation rate across sampled counties in 2023 of 20.06%, this is a very high error rate, about 28.02%. Given the magnitude of the RMSE, we would not recommend the implementation of this model for making real-world predictions. The error rate also varied across states and across different levels of urbanicity and racial make-up. The first bar graph below shows the error rates by state, which vary dramatically. Texas has by far the highest RMSE at over 12 percentage points, and Washington D.C. has the lowest. State size appears to be a factor contributing to this; states with fewer counties seem to have lower error rates than states with many counties (D.C. only has one, while Texas has 254).

predictions |> 
  mutate(state_name = case_when(state_fips == "01" ~ "Alabama",
                                state_fips == "05" ~ "Arkansas",
                                state_fips == "10" ~ "Delaware",
                                state_fips == "11" ~ "District of Columbia", 
                                state_fips == "12" ~ "Florida", 
                                state_fips == "13" ~ "Georgia", 
                                state_fips == "21" ~ "Kentucky",
                                state_fips == "22" ~ "Louisiana",
                                state_fips == "24" ~ "Maryland", 
                                state_fips == "28" ~ "Mississippi",
                                state_fips == "37" ~ "North Carolina", 
                                state_fips == "40" ~ "Oklahoma", 
                                state_fips == "45" ~ "South Carolina",
                                state_fips == "47" ~ "Tennessee", 
                                state_fips == "48" ~ "Texas", 
                                state_fips == "51" ~ "Virginia", 
                                state_fips == "54" ~ "West Virginia")) |>
  group_by(state_fips) |>
  mutate(state_rmse = rmse(snap_participation_rate, .pred)) |>
  ggplot() +
  geom_col(mapping = aes(x = state_name, y = state_rmse)) +
  theme(axis.text.x = element_text(angle = 45, vjust = .85)) + 
  labs(x = "State", y = "RMSE", title = "RMSE by State")

The second bar graph shows error rates by urbanicity (with 1 being most urban and 9 being the most rural). It appears that counties at either end of the urban/rural spectrum had the highest RMSE, while counties in the middle had the lowest.

predictions |> 
  group_by(rural_urban_score) |>
  mutate(rural_urban_rmse = rmse(snap_participation_rate, .pred)) |>
  ggplot() +
  geom_col(mapping = aes(x = rural_urban_score, y = rural_urban_rmse)) +
  theme(axis.text.x = element_text(angle = 45, vjust = .85)) + 
  labs(x = "Urban-Rural Score", y = "RMSE", title = "RMSE by Urbanicity")

The scatter plot and regression line below display the relationship between absolute error and county racial make-up. Counties with a higher non-white proportion of the population had slightly higher error rates.

predictions |>
  mutate(abs_error = abs(.pred - snap_participation_rate)) |>
  ggplot(mapping = aes(x = white_pct, y = abs_error)) + 
  geom_point() +
  stat_smooth(method = "lm", 
              formula = y ~ x, 
              geom = "smooth") +
  labs(x = "Proportion of the Population that is White", y = "Error", title = "Error by County Racial Demographics")

Differential error rates across different sectors of the population are crucial for researchers policymakers to consider when evaluating a model, as they may have ramifications on equity if policy decisions are made using the outputs of the model.

Limitations

The primary limitations of this project were time and computing power. Only having access to our personal laptops, we were not able to use samples as large or models as complex as we would have preferred. This likely contributed to the magnitude of the error rate, and potentially to the differential error rates, as we did not have sample sizes large enough to make accurate predictions for certain subsets of the population.

Bibliography

Creamer, J., & King, M. D. (2024, November 14). New interactive data tool shows how programs and expenses affect poverty measurement. U.S. Census Bureau. https://www.census.gov/library/stories/2024/11/supplemental-poverty-measure-visualization.html

Dickert-Conlin, S., Fitzpatrick, K., Stacy, B., & Tiehen, L. (2021). The downs and ups of the SNAP caseload: What matters? Applied Economic Perspectives and Policy, 43(3), 1026–1050.

Han, J. (2016). The impact of SNAP on material hardships: Evidence from broad-based categorical eligibility expansions. Southern Economic Journal, 83(2), 464–486.

Jones, J., Courtemanche, C., Denteh, A., Marton, J., & Tchernis, R. (2021). Do state SNAP policies influence program participation among seniors? IZA Discussion Paper Series, No. 14564. Institute of Labor Economics (IZA).

Pinard, C. A., Bertmann, F. M. W., Byker Shanks, C., Scharadin, B., & Yaroch, A. L. (2017). What factors influence SNAP participation? Literature reflecting enrollment in food assistance programs from a social and behavioral science perspective. Journal of Hunger & Environmental Nutrition, 12(2), 151–168.

Rothbaum, J., Fox, L., & Shantz, K. (2021, October). Fixing errors in a SNAP: Addressing SNAP under-reporting to evaluate poverty. U.S. Census Bureau.

U.S. Department of Agriculture, Food and Nutrition Service. (n.d.a). Broad-based categorical eligibility. https://www.fns.usda.gov/snap/broad-based-categorical-eligibility

U.S. Department of Agriculture, Food and Nutrition Service. (n.d.b). ABAWD Waivers. https://www.fns.usda.gov/snap/abawd/waivers